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We formulate necessary and sufficient conditions for an arbitrary 
discrete probability distribution to factor according to an undirected 
graphical model, or a log-linear model, or other more general expo- 
nential models. For decomposable graphical models these conditions 
are equivalent to a set of conditional independence statements simi- 
lar to the Hammersley-Clifford theorem; however, we show that for 
nondccomposable graphical models they are not. We also show that 
nondccomposable models can have nonrational maximum likelihood 
estimates. These results are used to give several novel characteriza- 
tions of decomposable graphical models. 



1. Introduction. Exponential models for discrete data have a long his- 
tory in statistics. In this paper, we take an algebraic approach to analyzing 
exponential models. Our starting point is to describe a class of exponential 
models for discrete distributions in terms of a polynomial mapping from 
a set of parameters to distributions. These models include two important 
well-known classes of models: the log-linear model and, an important type of 
log-linear model, the undirected graphical model. Representing the models 
as polynomials rather than in the more standard exponential representation 
allows us to use tools from computational algebraic geometry (e.g., [6]) to 
analyze the algebraic properties of these models. 

We begin by providing necessary and sufficient conditions for a discrete 
probability distribution to factor according to an undirected graphical model, 
or a log-linear model, or a more general exponential model. The factoriza- 
tion of distributions according to these classes of models is well studied (see, 
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e.g., [2, 4, 5]). Unlike previous analyses that either assume positivity or use 
the exponential representation, our use of a polynomial representation al- 
lows us to provide a uniform treatment of the factorization for both positive 
and nonpositive distributions. 

Next, utilizing computational tools and results from algebraic geome- 
try, we analyze constraints imposed on distributions by specific exponential 
models. More specifically, we transform parametrically defined models into 
implicit descriptions. The following example illustrates the concept of an 
implicit description. 

Example 1. Consider probability distributions over three binary vari- 
ables A,B,C defined parametrically as 

P(a,b,c) =Pabc oc ipAB(a,b)ipBc(b,c). 

This can be viewed as either a log-linear model with generators (AB) and 
(BC) or as an undirected graphical model A — B — C . The correspond- 
ing implicit description is given as follows. A probability distribution P = 
(Pooo,Pooi,Poio, Poll, Ploo,Pioi,Pllo, Pill) factors according to this model if 
and only if 

PoouPioo = PoooPioi and P011P110 = PoioPm- 

Perhaps the most well-known example of implicit descriptions of statisti- 
cal models is given by the Hammersley-Clifford theorem (e.g., [3, 19]) which 
characterizes the factorization of strictly positive distributions with respect 
to undirected graphs. 

Our analysis provides insight into two distinct but related approaches 
that have been used to study many types of graphical models including 
undirected and directed graphical models. The first approach is to define 
graphical models by specifying a graph according to which a probability 
distribution must factor in order to belong to the graphical model. This ap- 
proach was emphasized, for example, by Darroch, Lauritzen and Speed [7]. 
The second approach is to define graphical models by specifying, through 
a graph, a set of conditional independence statements which a probability 
distribution must satisfy in order to belong to the graphical model. This di- 
rection was emphasized, for example, by Pearl [22] and Geiger and Pearl [13]. 
Lauritzen ([19], Chapter 3) compared these approaches and herein we ex- 
tend his analysis. Our analysis allows us to identify the difference in these 
approaches and provides several novel characterizations of decomposable 
graphical models. 

We note that using tools from computational algebra in the study of im- 
plicit descriptions of statistical models is not new. For instance, Settimi and 
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Smith [26] and Geiger, Heckerman, King and Meek [12] analyze the geomet- 
ric structure of directed graphical models with and without latent variables 
from the perspective of real algebraic geometry. In addition, Pistone, Ric- 
comagno and Wynn [23] have used commutative algebra to study what we 
term the binary four-cycle model (see Example 4) and Garcia, Stillman and 
Sturmfels [11] have used commutative algebra to study graphical models 
with latent variables. 

The paper is organized as follows. In Section 2 we define a class of expo- 
nential models and describe log- linear and undirected graphical models. In 
Section 3 we provide necessary and sufficient conditions for a discrete prob- 
ability distribution to factor according to such an exponential model or to 
be the limit of distributions that factor. In Section 4 we focus our attention 
on undirected graphical models. We demonstrate that every nondecompos- 
able model implies nonconditional-independence constraints, and show the 
possibility of nonrational maximum likelihood estimates for some nondecom- 
posable models. Our analysis is summarized in Theorem 4.4, which provides 
various characterizations of decomposable models. 

2. Exponential, log-linear and graphical models. Our objects of study 
are certain statistical models for a finite state space X . We identify X with 
the set {1,2, ... ,m} and define a probability distribution over A* to be a 
vector P = (pi,... ,p m ) in R> such that p\ H +p m = l. 

The class of models to be considered consists of discrete probability dis- 
tributions defined via a d x m matrix A = (a^) of nonnegative integers. One 
technical assumption we will make about the matrix A is that all its column 
sums are equal, that is, J2i=i a n = J2i=i <H2 = ••• = Ya=i a im- We say that 
a probability distribution P belongs to model A if and only if P is in the 
image of the monomial mapping eft a which takes nonnegative real d-vectors 
to nonnegative real m- vectors: 

(2.1) <p A ■. - r^ , ( tl , . . . , td ) ~ (n *r . n c . . n ta r ) . 

\ i i i / 

where, as we do throughout the paper, we adopt the convention that t° = 1 
for t > 0. When P belongs to model A, we also say that P factors according 
to model A. The models described by (2.1) are usually described in the 
statistical literature as exponential families (models) of the form 

(2.2) P e {x) = Z{e)e^ e ' T ^\ 0e[-oc,oo) d , 

where x 6 X, Z{9) is a normalizing constant, (-, •) denotes an inner product 
and sufficient statistics T : X \— > Z rf \{0} where Z denotes the set of integers 
and is a vector of d zeroes. 
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The classes of models in (2.2) and (2.1) are identical. Note that each col- 
umn of A corresponds to a different state x of X . Thus, a model defined 
by (2.2) with sufficient statistics T{x) is equivalent to a model defined by 

(2.1) with matrix A if and only if the columns dj of A coincide with the 
corresponding T(x). For a particular distribution from a model of the form 

(2.2) with sufficient statistics T(x) and parameters 6, the corresponding 
parameters ti for the corresponding model in (2.1) are ti = exp(#j) where 
exp(— oo) = 0. We describe the models given by (2.2) in terms of a polyno- 
mial map because, as we shall see, this description allows us to use commu- 
tative algebra to provide algebraic descriptions of interesting properties of 
these models. Unlike the exponential representation, the use of a polynomial 
map allows us to directly analyze distributions that do not have full support. 
Recall that the support of an m-dimensional vector v is the set of indices 
supp(u ) = {ie{l,...,m}:vi^ 0}. 

This class of models includes log-linear and undirected graphical models 
used in the analysis of multiway contingency tables. When analyzing multi- 
way contingency tables, the state space is a product space X = Flxy-ex-^ 
where X = {X\, . . . ,X n } is a set of (random) variables, called factors, and 
Ixj is the set of levels (or states) for the factor Xj. A log-linear model is 
defined by a collection Q = {Q\, . . . ,Q m } of subsets of X. We refer to the 
Qi as the generators of the log-linear model. A log-linear model for a set of 
generators Q is defined as 

p(x)oc n few, 

where x £ X is an instantiation of the variables in X and tpg i (x) is a potential 
function that depends on x only through the values of the variables in Qi. 
This log-linear model can be represented in the following way by a matrix 
A as in (2.1). The columns of A are indexed by X = Ilx ex^ 1 The rows 
of A are indexed by pairs consisting of a generator Qi and an element of 
n^eC/i 1-Xj ■ All entries of A are either zero or 1. The entry is 1 if and only if 
the element in the row index is equal to the projection of the column index 
to the factors in the generator of the row index. 

Example 2. The no-three-way interaction model for binary factors X±, 
X2,X^ has generators Q = {{X±, X2}, {X2, X3}, {Xi, X3}} and is represented 
by the matrix 
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A probability distribution P = (p oo, Pool, Poio, Poll, Pioo,Ploi,Piio,Pm) fac- 
tors in the no-three-way interaction model if and only if it lies in the image 
of the associated monomial mapping 

<Aa:R>o — ► R>o, 

(2.3) 

(tl, . . . } ti2) — » ^1*5*9, *1*6*10,*2*7*9, *2*8*10,*3*5*11, *3*6*12,i4i7ill, *4*8*12)- 

An important type of log-linear model is the undirected graphical model 
[19]. Such a model is specified by an undirected graph G with vertex set 
X and edge set E. The undirected graphical model for the graph G is the 
log-linear model in which the generators are the cliques (maximal complete 
subgraphs) of the undirected graph G. The matrix A of (2.1) is a function 
of the graph G and we write it as A(G). Example 2 shows a log-linear model 
that is not graphical. 

Example 3. The three- variable-chain undirected graphical model with 
graph G equal to X\ — X2 — X3 has generators Q = {{X\, X2}, {X2, ^3}}- 
When each Xi is a binary variable, the matrix A(G) is identical to the first 
eight rows of the matrix of Example 2. 

An undirected graphical model is said to be a decomposable graphical 
model if and only if the graph G is chordal — that is, if every cycle of length 
4 or more has a chord. The undirected graphical model given in Example 3 
is a decomposable graphical model. We conclude this section with the four- 
cycle undirected graphical model, the simplest nondecomposable graphical 
model. This model will be examined in detail in Section 4.4. 



Example 4. The four-cycle undirected graphical model for binary vari- 
ables with graph G having four edges X\ — X2, X2 — X3, X3 — X4 and 
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Xj - X 4 has generators Q = {{X 1 ,X 2 },{X 2 ,X 3 },{X 3 ,X 4 },{X 1 ,X 4 }} and 
is represented by the following matrix A(G): 
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3. Exponential models and toric varieties. In this section, we study the 
algebraic structure of the exponential models in (2.1). We provide a char- 
acterization of those distributions that factor according to a model and of 
those distributions that are the limit of distributions that factor. Finally, 
we describe how one can use tools from commutative algebra to obtain a 
complete description of the set of distributions that factor according to a 
model of the form (2.1) or are the limit of distributions that factor in terms 
of polynomial equations not involving model parameters. 

3.1. Distributions that factor and limits of distributions that factor. We 
formulate necessary and sufficient conditions for a probability distribution 
to factor according to a matrix A and for a distribution to be the limit of 
distributions that factor according to a matrix A. 

The factorization of distributions according to exponential models is well 
studied (e.g., [2, 4, 5, 24]). Typically the analysis of these models is carried 
out using the exponential form given in (2.2). This type of analysis leads to 
the treatment of nonpositive distributions as special limiting cases such as 
the "boundaries at infinity" of Cencov [5] . The factorization of distributions 
according to log-linear models is also well studied (e.g., [8, 14, 15]). These 
analyses provide characterizations of factorization but only for positive dis- 
tributions. By utilizing the product form representation of (2.1), we provide 
a uniform treatment of the factorization for both positive and nonpositive 
distributions. 
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Another alternative to our approach and to using an exponential repre- 
sentation is given in Lauritzen's [18] development of a generalization of ex- 
ponential models called general exponential models. The general exponential 
model treats sufficient statistics as values in a commutative semigroup and 
replaces the exponential function with the members of a dual semigroup 
defined in terms of a homomorphism from the semigroup of sufficient statis- 
tics to the semigroup (R>o,-)- 111 certain examples, this approach yields a 
uniform treatment of positive and nonpositive distributions. 

The characterization of factorization for distributions of the form of (2.1) 
is provided in terms of a condition on the support of the distribution and a 
set of algebraic constraints. We begin with the condition on the support of 
the distribution. Let aj = (a±j, . . . ,aaj) denote the jth column vector of the 
dxm matrix A. Note that supp(aj) C {1, 2, . . . , d}. A subset F of {1, . . . , m} 
is said to be A-feasible if, for every j € {1, . . . ,m}\F, the support supp(aj) 
of the vector aj is not contained in UzeF su PP( a /)- Note that, trivially, the 
set {1, . . . , m} is A-feasible. 



Lemma 1. A probability distribution P factors according to A only if 
the support of P is A-feasible. 



Proof. Let P be a probability distribution which factors according to 
A, that is, P E image (c^). We must show that F = supp(P) is A-feasible. 
Let (ti, . . . ftd) be any preimage of P under <pA- Then 



(3.1) 



Pj 



d 



lJ > 0, for j € F, 
0, for j F. 



Suppose that F is not A-feasible. Then supp(afc) lies in UzeF su PP( a f° r 
some k €" F. Consequently for every i £ supp(afc), there exists an / £ F 
such that aif > 0. Hence, due to (3.1), ti > for every i € supp(afc). Thus 
Pk = lliesupp(a fe ) ti* k > contrary to our assumption that k ^ F. □ 



Next we turn to the algebraic condition. The nonnegative toric variety 
X A is the set of all vectors (x\, . . . , x m ) E R> which satisfy 



( Q r )\ ™«1 ™ u 2 . . . „«m _ Vl V 2 V m 

whenever u = (u\, . . . , u m ) and v = (yi, . . . ,v m ) are vectors of nonnegative 
integers which satisfy the d linear relations 



(3.3) u\a\ + U2CL2 H h u m a m = v\a\ + v<ia<i H h v m a m . 
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Note that (3.3) merely states that u — v is in the kernel of the matrix A, that 
is, the matrix A times the column vector u — v is zero. Since the exponents 
iti, ... , u m ,vi, . . . ,v m used in (3.2) were assumed to be integers, the set X A 
is indeed an algebraic variety, that is, the zero set of a system of polynomial 
equations. 

Lemma 2. A probability distribution P factors according to A only if P 
lies in the nonnegative toric variety X A . 

Proof. We need to show that the image of (f>A is a subset of Xa- Indeed, 
suppose that x = (x\, . . . , x m ) S image (4>a)- There exist nonnegative reals 
t\, . . . ,td such that Xi = t ^ 1 ^ 1 ■ ■ ■ t^ di for i = 1, . . . , m. This implies that 
(3.2) has the form 

(d \ u l/d \ u 2 / d \ «m 

ik") (n*rj -(nrj 

/ d \ "1 / d \ «2 / d \ v m 

{u>r) (ji'/ j -(nrj 

and hence it holds whenever (3.3) holds. Thus, x lies in X A . □ 

The following theorem provides a characterization of distributions that 
factor in terms of these two conditions. 

Theorem 3.1. A probability distribution P factors according to A if 
and only if P lies in the nonnegative toric variety Xa and the support of P 
is A- feasible. 

The only-if direction has been proved in Lemmas 1 and 2. The if direction 
is provided in the Appendix. 

We now turn our discussion to the set of distributions that do not factor 
but are the limit of distributions that factor. In general, image(^) is not 
a closed subset of the orthant R> - This is important because if there are 
distributions that do not factor according to a model but are the limit of 
distributions that do factor, then there is no unique maximum likelihood 
estimate (MLE) for some data sets. See Section 4 for an analysis of such 
phenomena in the four-cycle undirected graphical model. We will see in 
Theorem 4.4 that image(<^4) is closed for an undirected graphical model if 
and only if the model is decomposable. 

Our next theorem says that the set of probability distributions which lie 
in the toric variety X A coincides with those in the closure of the image of 
<f>A — that is, Xa = closure(image(</^)). Note that the closure can be taken 
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in either the usual metric topology or in the Zariski topology because the 
closures of an image of a polynomial map taken in these topologies are the 
same. This result means that P E Xa if and only if P factors according to 
A, or P is the limit of probability distributions which factor according to 
A. The set of distributions in Xa, when A consists only of zeroes and l's, 
is called an extended log-linear model by Lauritzen [19]. Thus Theorem 3.2 
below amounts to an algebraic description of extended exponential models 
and, thus, extended log-linear models and extended undirected graphical 
models. 

Theorem 3.2. A probability distribution P factors according to A or is 
the limit of probability distributions that factor according to A if and only if 
P lies in the nonnegative toric variety Xa- 

The proof of Theorem 3.2 is provided in the Appendix. 

Theorems 3.1 and 3.2 together characterize probability distributions in 
Xa \ image ((^4), namely, distributions that do not factor but are the limit 
of distributions that do factor. These distributions are those that lie in Xa 
but have a support which is not A-feasible. 

3.2. Describing exponential models by binomial equations. In this sec- 
tion we describe an implicit representation of the toric variety that contains 
the distributions that factor according to the model in (2.1). The implicit 
representation is given in terms of the common zero set of a finite list of poly- 
nomial equations. These polynomial equations are interesting from both an 
algorithmic and a theoretical point of view in that they describe constraints 
on probability distributions that must hold for any distribution that factors 
according to the model. 

Implicit representations of statistical models (see Example 1) are naturally 
described using the language of ideals and varieties. We briefly review these 
basic concepts from algebra and refer the reader to an excellent text by Cox, 
Little and O'Shea [6] for more details. All algebra terminology we use which 
is not defined in this paper can be found in [6]. 

We work in the ring ~R[x) = R[xi, . . . , x m ] of polynomials with real coeffi- 
cients in the indeterminates x\, . . . ,x m . An ideal I is a nonempty subset of 
R[x] which satisfies two properties: (1) if q±, q2 E /, then q\ + q2 E /, and (2) 
if b E R[x], and q E /, then bq E /. With every ideal / in R[x] we associate 
a set of varieties, 

Xf = {xeK m : q{x) = for every q E /}, 

where K denotes either the positive real numbers R>o or the nonnegative 
real numbers R>o- To simplify the notation, we write A >0 and X rather 
than X R>0 and A R ^°, respectively, and drop the explicit reference to the 
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ideal when the associated ideal is clear from context. For x € K m , testing 
x G X K is equivalent to checking that q(x) = for all q € /. In the analysis 
of statistical models K = R>o corresponds to the set of (nonnegative) prob- 
ability distributions, and K = R>o corresponds to the set of strictly positive 
probability distributions. 

The task of checking that a point (e.g., a distribution) is in the zero set 
of each of the polynomials in an ideal appears extremely hard, but there 
are two fundamental results which make it more tractable. Hilbert" ' s basis 
theorem states that every ideal in R[x] is finally generated, namely, every 
ideal I in R[x] contains a finite subset {gi, . . . ,g n }, called an ideal basis of 
I, such that every q € / can be written as q(x) = J27=i^i( x )9i( x ) where bi 
are polynomials in R[x]. Consequently, a point x in K m lies in X K if and 
only if gi(x) = ■ ■ ■ = g n (x) = 0. The ideal generated by a set of polynomials 
g = {gi , . . . ,g n } is denoted by {g±, . . . , g n ) . The second fundamental result, by 
Buchberger, is an algorithm that produces a distinguished ideal basis, called 
a Grobner basis, for any given ideal /. An ideal basis g for / is a Grobner 
basis for / in some term order (say lexicographical, or reverse lexicographical 
order) if the set of highest-ordered terms of the polynomials in g generates 
the ideal generated by the highest-order terms of all polynomials in i". 

The important property of Grobner bases is that they allow one to check, 
in an efficient manner, whether a polynomial constraint belongs to an ideal. 
For example, if one obtains a small Grobner basis for a graphical model 
under study, then one can use it to answer whether any cross product ratio, 
or any other polynomial constraint, must hold in that model. The focus on 
studying the ideals rather than the associated varieties also stems from the 
complexities introduced by allowing probability distributions that are not 
strictly positive. 

In this paper, we consider ideals generated by a set of polynomials each 
having precisely two terms. Such polynomials are sometimes called binomi- 
als. The toric ideal I a associated with adxm integer matrix A is generated 
by the binomials x" 1 • • -x^ 1 — x^ 1 ■ ■ -x"^ satisfying (3.3). A variety corre- 
sponding to a toric ideal is called a toric variety. An introduction to toric 
ideals can be found in [28]. We can now rewrite Theorems 3.1 and 3.2 as 
follows. 

Theorem 3.3. A 'probability distribution P factors according to an ex- 
ponential model A if and only if the support of P is A-feasible and all poly- 
nomials in an ideal basis of the toric ideal I a vanish at P. 

Theorem 3.4. A probability distribution P is the limit of probability 
distributions that factor according to A if and only if all polynomials in an 
ideal basis of the toric ideal I a vanish at P. 
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We call these the factorization theorem and the limit factorization theo- 
rem, respectively. Thus, if we know a small ideal basis for I a, then we can 
efficiently test whether or not a distribution P lies in X A by checking that 
P satisfies these polynomials. It is important to note that it is frequently 
possible to replace the ideal basis g of an ideal I = {g\ , . . . , g n ) with a smaller 
basis g' for an ideal J such that the variety Xj agrees with the variety Xj in 
the nonnegative orthant (i.e., X I - = Xj - ). Thus, one can often identify 
smaller sets than the ideal basis for I a for use in Theorems 3.3 and 3.4 when 
testing a distribution. We demonstrate that, even in the case of decompos- 
able models, the ideal basis for I a is typically larger than an ideal basis for 
an ideal whose zero set defines Xa- For arbitrary undirected graphical mod- 
els, the Hammersley-Clifford theorem, to be discussed in the next section, 
defines a small subset of binomials whose zero set defines X^® G ^ , while it is 
an open problem to describe the Grobner basis for an arbitrary undirected 
graphical model G. 

4. Algebraic analysis of graphical models. The algebraic tools developed 
in Section 3 will now be applied to the undirected graphical models A(G). 
We compare and contrast the Hammersley-Clifford theorem (e.g., [19], page 
36; [3]) with the factorization Theorem 3.3 and limit factorization Theorem 
3.4. We investigate the form of the ideal bases for decomposable and nonde- 
composable models. We also study the algebraic complexity of the maximum 
likelihood estimator for undirected graphical models. Our main result is a 
characterization of decomposable graphical models in terms of their ideal 
basis, the rationality of maximum likelihood estimates, and whether the 
model contains all of its limit points. 

4.1. Quadratic polynomials representing conditional independence. The 
set of probability distributions that satisfy a conditional independence state- 
ment can be regarded as an algebraic variety. In this subsection we explain 
how to derive the defining ideal of such a variety. The ideal basis will con- 
sist of certain quadratic polynomials which we call cross-product differences 
(CPDs). Given three discrete random variables X, Y, Z, we define 



where x and x' are levels of X and y and y' are levels of Y and z is a level 
of Z. Note that cross-product differences are closely related to cross-product 
ratios (CPRs); the CPR is defined as follows: 



(4.1) 



cpd(X = {x,x'},Y = {y,y'}\Z = z) 

= P(x, y, z)P(x', y', z) - P(x', y, z)P(x, y' , z) 



(4.2) cpr(X = {x,x'},Y = {y,y'}\Z = z) 



^ P(x,y,z)P(x',y',z) 
P{x',y,z)P(x,y',z)' 
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Cross-product ratios (also called conditional odds-ratios) are a fundamental 
measure of association and interaction and are often used to interpret the pa- 
rameters of a log-linear model (see, e.g., [1]). A CPD and the corresponding 
CPR constraint are identical in the sense that 

cpd(X = {x, a/}, Y = {y, y'}\Z = z) = if and only if 

cpr(X = {x,x'},Y = {y,y'}\Z = z) = 1, 

provided the denominators in (4.2) are nonzero. We prefer CPD constraints 
to avoid dividing by zero for nonpositive distributions. However, when inter- 
preting higher-degree binomials in the toric ideal of an undirected graphical 
model, it is convenient to describe the constraints in terms of cross-product 
ratios which can then be converted into binomial constraints by clearing the 
denominator as above. To simplify notation, when X and Y each represent 
a single binary variable, we write 

U1\ rnrfX Y\Z - z) - E^ll^EMj^A 

(4.3) C ^ X ^ Z -^-p( X ',y, Z )P( X ,y', Z y 

Let Xi, . . . ,X n denote discrete variables, where Ixj is the set of levels 
of the variable Xj. We fix the polynomial ring ~R[X] whose indeterminates 
are elementary probabilities p ai a 2 — a n which are indexed by the elements of 
X = Ix 1 x Ix 2 x • • • x Ix n ■ Conditional independence statements have the 
form 

(4.4) X is independent of Y given Z, 

where X, Y and Z are pairwise disjoint subsets of . . . ,X n }. The state- 
ment (4.4) translates into a large set of CPDs of the form (4.1). Namely, 
we take cpd(X = {x,x'},Y = {y,y'}\Z = z), where x,x' runs over distinct 
elements in Ylx ex ^Xn where y,y' runs over distinct elements in ]J X , eY Ix A 
and where z runs over Ylx k GZ ^x k - Note that some of these CPDs may be 
redundant. 

Each probability P(x,y,z) occurring in the CPDs of a conditional inde- 
pendence statement is obtained by marginalizing over all of the elementary 
probabilities p ai a 2 - a n for which the indices agree with x, y and z. This means 
that the probability P(x,y,z) is a polynomial of degree 1 in RfA']. The lin- 
earity of probabilities and the form of the CPD in (4.1) lead to the following 
remark. 

Remark 1. Conditional independence statements translate into a sys- 
tem of CPDs that correspond to quadratic polynomials. 

The conditional independence statement (4.4) is said to be saturated if 
X UY L) Z = {X\, . . . ,X n }. The fact that probabilities in the CPDs asso- 
ciated with a saturated conditional independence statement do not require 
marginalization leads to the following remark. 
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Remark 2. Saturated conditional independence statements about the 
variables X±, . . . ,X n translate into a set of quadratic binomials. 

4.2. Undirected graphical models and the Hammersley-Clifford theorem. 
In this section, we describe sets of conditional independence statements 
derived from separation statements in undirected graphs and their con- 
nection to the factorization of distributions. Of particular interest is the 
Hammersley-Clifford theorem that relates the factorization of a strictly pos- 
itive distribution P according to an undirected graphical model to a set of 
conditional independence statements that must hold in P. We describe the 
Hammersley-Clifford theorem in the language of ideals and varieties and 
compare it to our factorization theorem. 

Let G be an undirected graphical model with variables {X\, . . . , X n } as be- 
fore. We define / pa irwisc(G) to be the ideal in R[Af] generated by the quadratic 
binomials corresponding to all the saturated conditional independence state- 
ments 

(4.5) Xi is independent of Xj given {X±, . . . , X n }\{Xi, Xj}, 

where (Xi,Xj) runs over all nonedges of the graph G. Note that (4.5) is sat- 
urated, so the polynomials arising from the construction in the previous sec- 
tion are indeed binomials. The ideal / pa irwisc(G) defines a variety -Xj^ir^wo) 
where K can be either R>o or R>o- When K = R>o, the superscript of X 
is dropped. 

The pairwise Markov property is discussed in Section 3.2.1 of [19]. Lau- 
ritzen uses the notation Mp(Q) to denote the variety -X" pa irwise(G) ■ We will 
also need the (saturated) global Markov property. This is described in our 
language as follows. We define I g i bai(G) to be the ideal in Tt[X] generated 
by the quadratic binomials corresponding to all the saturated conditional 
independence statements (4.4) where Z separates X from Y in the graph 
G. (The term global Markov is often used to describe the set of conditional 
independence statements that follow from all separation statements rather 
than only the saturated separation statements. The fact that only the satu- 
rated statements are needed follows from simple properties of undirected 
graphs and conditional independence. The required conditional indepen- 
dence properties are properties CI and C2 of [19], page 29. The required 
graph property is that any unsaturated separation statement in a graph is 
implied by a saturated separation fact also true in the graph.) This separa- 
tion condition means that every path from a vertex in X to a vertex in Y 
must pass through some vertex in Z. The ideal igiobai(G) defines a variety 
"^pairwise(G) wnere K is either R>o or R>o- When K = R>o, the superscript 
of X is dropped. Lauritzen [19] states the following three inclusions, which 
hold for every graph G. Each of the following three inclusions can be strict: 

(4.6) image ((^(G)) C Xa{G) £ ^global(G) Q ^pairwisc(G) • 
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The following example provides an illustration of quadratic polynomials 
generating Ip airw i se ( G ) . 

Example 5. Consider the four-cycle undirected graphical model of Ex- 
ample 4. This graph has four maximal cliques, one for each edge. The prob- 
ability distributions defined by this model have the form 

P(x 1 ,X 2 ,X 3 , X 4 ) OC ^{1,2} (Xi , X 2 )lp{ 2 ,3} (»2, £3>0{3,4} (^3) ^^{M} (xi,X 4 ). 

(4.7) 

If all four variables are binary, then the pairwise ideal is 

-^pairwise(G) = (PlOllPlllO -PlOlOPlllliPOllLPllOl -POlOlPllll) 
PlOOlPllOO — PlOOOPllOliPOllOPllOO — POlOOPlllO) 

(4.8) 

PoonPiooi — PooouPioiijPooiiPoiio — PoowPom, 
PoooiPoioo — PooooPoioijPooioPiooo — PooooPioio}- 
This is a binomial ideal in a polynomial ring in sixteen indeterminates: 

-fpairwisc(G) C R-[<"f] = R-[P0000 > POOOl , P0010 , • • • 

The left column of four binomials in (4.8) represents the statement U X 2 is 
independent of X4 given {Xi,Xs}" and the right column of four binomials 
in (4.8) represents the statement 11 X\ is independent of X% given {X 2 ,X4}." 
The variety -Xpairwise(G) is the set of all points in K 16 which are common 
zeros of these eight binomials. Note that ipairwise(G) = -^giobai(G) f° r the four- 
cycle model and therefore, for this model, the right inclusion of (4.6) is an 
equality. 

The following well-known theorem (e.g., [19], page 36) relates the ideal of 
pairwise conditional independence statements and factorization. 

Theorem 4.1 (Hammersley-Clifford). Let G be an undirected graphical 
model. A strictly positive probability distribution P factors according to A(G) 
if and only if P is in the variety X> a ° rwise(G) ; that is, X> ( ° G) = X>^ rwige(G) . 

Our factorization Theorem 3.3 generalizes the Hammersley-Clifford the- 
orem in two respects. First, it does not require the probability distribution 
P to be strictly positive. Second, it does not require the model represented 
by matrix A to be an undirected graphical model. The main advantage of 
the Hammersley-Clifford theorem over the factorization theorem is com- 
putational. That is, the set -f pa i rw i se (G) is easily described in terms of the 
structure of the graph while one must usually resort to a symbolic algebra 
program to produce an ideal basis or a Grobner basis for I a- 
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The proof of the Hammersley-Clifford theorem given in [19] actually es- 
tablishes the following slightly stronger result: any integer vector in the 
kernel of the matrix A[G) is an integer linear combination of the vectors 
u — v corresponding to the binomials p u — p v arising from the conditional 
independence statements for the nonadjacent pairs (Xi,Xj) in G. Translat- 
ing this statement from the additive notation into multiplicative notation, 
we obtain the following: 

A binomial p u — p v lies in the toric ideal Ia(G) of an undirected graphical 
model A(G) if and only if some monomial multiple of it, that is, a binomial 
of the form p u+w - p v+w , lies in 7 pa i IW i 8e (G) ■ 

This fact is important for computational purposes. It means that we can 
use the quadratic binomials in / pa irwise(G) as input when computing the 
toric ideal I A {G) by Algorithm 12.3 of [28]. 

4.3. Decomposable models. In this section, we discuss factorization and 
ideal bases for the variety of probability distributions corresponding to de- 
composable graphical models. 

Theorem 4.2 ([19], Proposition 3.19). Let G be a decomposable graphi- 
cal model. A probability distribution P factors according to A(G) if and only 
if P is in X global(G) . 

This theorem is analogous to the Hammersley-Clifford theorem in that it 
provides an implicit description of distributions that factor according to a de- 
composable graph in terms of conditional independence statements. Unlike 
the Hammersley-Clifford theorem, this theorem is not restricted to positive 
distributions. An immediate corollary to this theorem is the following: if P 
is a limit of probability distributions that factor according to A(G), then P 
itself factors according to A(G). This implies that the support of a distribu- 
tion P need not be tested in order to decide whether P factors according to 
a decomposable model. Furthermore, for a decomposable graphical model 
G, two of the inclusions in (4.6) are equalities, 

(4.9) image(^( G )) = X A (G) = X global ( G) C A pairwise(G ) , 

but the inclusion on the right-hand side is generally strict. The two equalities 
on the left are equivalent to Theorem 4.2. We shall see in Example 6 below 
that the inclusion on the right is strict for the four-chain model. 

Not every toric ideal I a which is generated by quadratic binomials has 
a Grobner basis consisting of quadratic binomials (see, e.g., [28]). It turns 
out that toric ideals arising from decomposable graphical models are well 
behaved in this regard. 
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Theorem 4.3. Let G be a decomposable graphical model. Then the set 
of quadratic binomials representing CPDs for saturated conditional indepen- 
dence statements for G forms a Grobner basis of the toric ideal Ia(G) ■ 

A nice proof of this theorem was given by Hosten and Sullivant [17]. 
Their Theorem 4.17 explicitly constructs a minimal (and reduced) Grobner 
basis for an arbitrary decomposable graphical model. From an algebraic 
point of view, we note that equation Xj[(G) = ^Jobal(G) ^ or cnor dal graphs 
G holds not just for K = R>o, but also for K = R and for K = C. Takken 
[29] and Dobra [10] proved that this equality holds in the ideal-theoretic 
sense, namely, that Ia(G) = ^giobai(G) • This means that the CPDs (quadratic 
binomials) representing global conditional independence statements contain 
an ideal basis for the toric ideal Ia{g) where G is decomposable. 

Some of the statistical implications of this result are explicated in Dia- 
conis and Sturmfels [9]. They showed that every minimal ideal basis of the 
toric ideal I a provides a set of moves for a Markov chain Monte Carlo ap- 
proach to sampling from the conditional distribution of data given sufficient 
statistics for discrete exponential families of the form (2.1). They showed 
that a minimal Grobner basis guarantees that the resulting Markov chain is 
connected, and that no proper subset of such an ideal basis has this property. 

We complete this section with an example that illustrates that the right- 
most subset relation in (4.9) is strict and the fact that the Hammersley- 
Clifford theorem fails for nonpositive graphical models even when the models 
are decomposable. 

Example 6. Consider the chain model G4 for four binary variables 
X\ — X2 — X3 — X4. The ideal representing the pairwise Markov property is 
generated by twelve quadratic binomials: 

-fpairwisc(G 4 ) = (POOlOPlOOO ~ POOOOPlOlO, POOOlPlOOO ~ P0000P1001, 

P0001P0100 — PooooPoioi>PooiiPiooi — POOOlPlOll! 
P0011P1010 — PooioPioiijPooiiPoiio — POOlOPOlll) 

POllOPllOO — P0100Plll0;P010lPll00 — POlOOPllOlj 
PlOOLPllOO — Pl000Pll01,P011lPll01 — POlOlPllllj 
POlllPlllO — P0110P1111 j P1011P1110 — PlOlOPllll)- 

There are many probability distributions which show that the inclusion in 
(4.9) is strict for this example. For instance, take pooio =Piin = 1/2 and all 
other 14 indeterminates zero. The twelve ideal generators of ip a irwise(G 4 ) an 
vanish at this distribution but the binomial pooiiPmo — P0010P1111 S -T4(G 4 ) 
that is implied by the independence of X^ and {Xi,Xi\ given X3 does not. 
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4.4. Nondecomposable models. We now discuss nondecomposable undi- 
rected graphical models from the perspective of the factorization Theorem 
3.3 and study the implicit description of distributions that factor in terms 
of the ideal bases for the toric ideal Ia(G)- These ideal bases contain poly- 
nomials which do not correspond to conditional independence statements. 
We explicitly describe the nonconditional-independence polynomials for the 
four-cycle model and demonstrate that the degree of the polynomial con- 
straints describing factorization can grow exponentially in the number of 
variables. 

Probability distributions which factor according to the four-cycle model 
of Example 4 must satisfy not just the eight quadratic binomials in (4.8), 
which arise from pairwise conditional independence statements, but they 
must satisfy certain additional polynomials of degree 4 listed in (4.10). 



Proposition 1. Consider the four-cycle undirected graphical model of 
Example 4 with graph G' . A probability distribution P factors according to 
the four-cycle or is the limit of probability distributions that factor according 
to the four-cycle if and only if P satisfies the following ideal basis of the 



toric ideal I 



A(G')- 



l A{G') — -<pairwise(G' 



) + (/: 



diff rdifi rdifi fdffi p 
12 'i23 i/34 )/l4 W12 



same j'same j-same ;samc\ 
j J 2 



'23 'J34 5 J 14 



Jl4 /) 



where 



(4.10) 



fdiff 
J12 

^diff 
J23 

^diff 
/34 

^diff 
Jl4 



'12 



'23 



/31 



'11 



: PoiooPomPiooiPioio 
: P0010P0101P1011P1100 
: P0001P0110P1010P1101 
: P0001P0111P1010P1100 

: P0000P0011P1101P1110 
: PooooPomPiooiPmo 
: PooooPomPionPiioo 
: PooooPonoPionPiioi 



■ P010lR>110Pl000Pl011i 

■ P0011P0100P1010P1101 1 

■ POOlOPOlOlPlOOlPlllOi 

■ R>01lP010lPl000Plll0i 

■ P0001P0010P1100P1111 1 

■ P0001P0110P1000P1111 1 

■ P0011P0100P1000P1111 1 

POOldPOlOOPlOOlPllll- 



The basis given in this proposition is obtained from Algorithm 12.3 of [28] 
using the eight quadratic generators of / pa irwise(G') an d the polynomial map 
4>A{G')- 

Next we provide an interpretation of the ideal basis of the four-cycle given 
in Proposition 1. The basis prescribed by (4.10) can be described in terms 
of a ratio of cross-product ratios. In particular, using the definition of CPR 
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in (4.3), the eight new basis elements (4.10) can be written as follows: 



(4.11) 



cpr(X 3 


X 4 


X\X 2 


= 01)/cpr(X 3 


X A 


X\X 2 


= 10) 


= 1 


cpr(X 1 


X A 


x 2 x 3 


= 01)/cpr(Xi 


x A 


x 2 x 3 


= 10) 


= 1 


cpr(X 1 


x 2 


X3X4 


= 01)/cpr(Xi 


x 2 


X 3 X A 


= 10) 


= 1 


cpr(X 2 


x 3 


X\X A 


= 01)/cpr(X 2 


x 3 


X\X A 


= 10) 


= 1 


cpr(X 3 


X A 


X\X 2 


= 00)/cpr(X 3 


X A 


X\X 2 


= 11) 


= 1 


cpr(X 1 


X A 


X 2 X 3 


= 00)/cpr(Xi 


X A 


x 2 x 3 


= 11) 


= 1 


cpv(X 1 


x 2 


X 3 X A 


= 00)/cpr(Xi 


x 2 


x 3 x A 


= 11) 


= 1 


cpr(X 2 


x 3 


X\X A 


= 00)/cpr(X 2 


x 3 


X\X A 


= 11) 


= 1 



These constraints force the association between adjacent variables in the 
four-cycle to be identical for various values of the remaining variables. Thus, 
these constraints, when written as polynomials rather than ratios of poly- 
nomials, are restricting higher-order interactions, but, surprisingly, are only 
needed for the characterization of nonpositive distributions. 

Proposition 1 provides an ideal basis for the four-cycle undirected graph- 
ical model; however, the problem of explicitly providing a basis for an arbi- 
trary undirected graphical model remains open. 

We note that there is no general upper bound for the degrees of the 
binomials in the ideal basis of an undirected graphical model. For instance, if 
each variable in the four-cycle model has p levels, then there exists a minimal 
generator of degree >p. Such a binomial can be derived from Proposition 
14.14 in [28]. The next proposition demonstrates that the maximal degree 
of the polynomials in the ideal basis is unbounded when the complexity of 
the model increases even when all variables remain binary. 

Proposition 2. There exists an undirected graphical model for 2n bi- 
nary variables X%, . . . ,X 2n whose ideal basis contains a binomial of de- 
gree 2 n . 

Proof. Let G be the undirected graphical model whose only nonedges 
are {Xi, Xi +n } for i = 1, 2, . . . , n. Thus this model represents n pairs of non- 
interacting binary variables. Let p u denote the product of all indeterminates 
Pii—i2n sucn that i\ = i 3 = % = ■ ■ ■ = i 2n -\ and i\ has the same parity as 
i 2 + ii + ie + ■ • ■ + i 2n , and let p v denote the product of all indeterminates 
Pii-hn such that %\ = i 3 = i$ = - ■ ■ = i 2n -\ and i\ has parity different from 
i>2 + H + H + ' — \~ i-2n ■ Then p u — p v is a binomial of degree 2 n which lies in 
the toric ideal Ia(G)- It can be checked, for instance using Corollary 12.13 
in [28] , that p u — p v is a minimal generator of Ia(G) ■ ^ 
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The undirected graphical models in the previous proof provide an in- 
teresting family for further study. Note that for n = 2 this is precisely the 
four-cycle model, and for n = 3 this is the edge graph of the octahedron, with 
cliques {1,2,3}, {1,2,6}, {1,3, 5}, {1,5, 6}, {2,3, 4}, {2, 4, 6}, {3, 4, 5}, {4, 5, 6}. 
Here the binomial constructed in the proof of Proposition 2 equals 

p u — p = PoOOOOOPOOOlOlPOlOOOlPOlOlOOPlOlOllPlOlllOPlllOlOPllllll 

— POOOOOlPOOOlOOPOlOOOOPOlOlOlPlOlOlOPlOllllPlllOllPlllllO 

which can also be written as a ratio of ratios of CPRs. 

4.5. Variety differences for the four-cycle model. We focus on the fol- 
lowing relationships which hold for every undirected graphical model [see 
(4.6)]: 

image {4> A (G)) Q x A(G) = closure(image(^ (G) )) C X global(G) . 

Lauritzen showed via examples that both inclusions are strict for the four- 
cycle model, in contrast to (4.9) for decomposable models. In this section, 
we continue our algebraic analysis of the four-cycle model, studying the 
set differences Xmq-\ \ \ma,ge{4> a{G)) an d -^globai(G) \Xa(G)- The examples 
considered herein are used in the proof of our characterization theorem of 
decomposable models (Theorem 4.4). 

The distributions that lie in X^q^ \ image(0^( G )) are those that have a 
support which is not j4-feasible. The following example from [19], page 37, 
illustrates such a distribution and is due to Moussouris [21]. 

Example 7. Consider the probability distribution over four binary vari- 
ables X\ , X<i , X3 , Xi where 

(4.12) poooo =Poooi =Piooo =Poon =Pnoo =Vom = Pmo =Piin = 1/8. 

This distribution satisfies all 16 binomial generators of Ia(g') where A(G') is 
the 16 x 16 matrix in Example 4, and hence lies in the toric variety -X"a(G')- 
However, this distribution does not factor according to the four-cycle because 
the support is not ^4-feasible. This can be seen from the matrix A(G'): if 
F is the set of eight column indices appearing in (4.12), then UzgF su PP( a ') 
consists of all 16 row indices of A(G'). 

Because the distribution (4.12) is in Xmqi^ we know that it is the limit 
of distributions that factor. Lauritzen proves this by writing it explicitly as 
a limit of distributions that factor according to G' . This example highlights 
the importance of being ^4-feasible when it comes to factorization, and it 
illustrates our characterization of distributions that do not factor but are 
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the limit of distributions that do factor, which is provided by Theorems 3.1 
and 3.2. 

We now discuss the set difference -Xgiobai(G) \ Xa(G) °f distributions that 
satisfy the global Markov property but are not limits of distributions that 
factor. 

Example 8. Let P be the distribution over four binary variables in 
which 

Poioo = Pom =Piooi =Pioio = 1/4. 

This distribution satisfies the global Markov property for the four-cycle 
[i.e., it lies in -Xgiobai(G')]- However, it is not the limit of distributions that 
factor [i.e., it does not lie in X^iqi) because f\^{P) ^0]. Note that the 
other 15 generators of -^a(G') vanish at P. 

We note that probability distributions with these properties in the four- 
cycle models (for ternary variables) were found by Matus and Studeny [20] . 
Also see Example 3.15 in [19], page 41. This example also demonstrates 
an immediate corollary of Proposition 1 and the limit factorization Theo- 
rem 3.4: the set -Xpairwise(G') V^Cl(G') contains all probability distributions 
that do not factor and are not the limit of distributions that factor but sat- 
isfy the pairwise conditional independence statements of the Hammersley- 
Clifford theorem. 

Finally, our contribution to the study of this four-cycle model is to provide 
a completely general algebraic method for describing the set -Xgiobal(G') \ 
Xmq/\. This set consists of all distributions in -Xgiobai(G') except those which 
violate at least one polynomial in Proposition 1. For this claim to hold we 
need to show that none of the 16 generators listed in Proposition 1 for the 
four-cycle model is redundant in the limit factorization Theorem 3.4, that 
is, for any of these 16 binomials in the ideal basis of Ia(g) there exists a 
probability distribution which satisfies the other 15 binomials but does not 
lie in X^q\. Example 8 provides one such distribution and others can be 
constructed in an analogous fashion. 

4.6. Maximum likelihood estimation. In this section, we consider the 
problem of maximum likelihood estimation for undirected graphical models. 
One of the nice properties of decomposable models is that the maximum 
likelihood estimates are provided by a simple ratio of counts (see, e.g., [19], 
page 91). We demonstrate that the situation with nondecomposable models 
and the general exponential models of Section 3 is not so nice. 

For nondecomposable models the maximum likelihood estimate need not 
exist for the model as defined by (2.1). This can be seen by considering the 
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problem of finding the maximum likelihood estimate (MLE) for the four- 
cycle undirected graphical model with four binary variables when given a 
data set with the empirical distribution given in Example 7. As we have 
seen, the support of this distribution is not ^4-feasible and, thus, cannot 
be parameterized as described in Example 4. Given that the distribution 
satisfies the generators of the ideal basis, we know that the distribution 
lies in the closure of the set of distributions parameterized in Example 4. 
This example demonstrates that MLE can fail to exist. In particular, if the 
empirical distribution is in the boundary of the model but cannot be factored 
according to the model, then the MLE will fail to exist. 

It is natural to extend the model, as described in Section 3, to include 
the distributions that are limits of distributions that factor according to the 
model. For the remainder of the paper, we consider only extended undirected 
graphical models and extended log-linear models. This approach was used 
by Lauritzen ([19], Chapter 4) to demonstrate that the MLE always exists 
for extended log-linear models. As noted in Section 3, the toric variety for 
the model and its ideal provide algebraic descriptions of extended log-linear 
models. In fact, one can compute the MLE by using a purely algebraic 
approach by (1) parameterizing the model with cell counts, (2) forcing the 
set of polynomial generators in the ideal basis to be equal to zero, and 
(3) forcing the set of marginal counts for each of the possible values for the 
cliques of the undirected graphical model (or, more generally, the generators 
of the log-linear model) to match the sum of the associated cell counts. The 
MLE is the unique real- valued nonnegative solution to this set of polynomial 
equations (see, e.g., [19]). 

Framing the problem of identifying the MLE as an algebraic problem 
allows the use of algebraic tools to analyze properties of the MLE for non- 
decomposable models. In the remainder of this section we use algebraic 
methods from Galois theory to demonstrate that the MLE for a nondecom- 
posable model is not necessarily rational and that one cannot generally write 
the MLE for nondecomposable models in closed form. 

Consider the four-cycle model for four binary variables (Example 4). We 
present the maximum likelihood estimation for this model in full detail for 
one explicit nontrivial data set, namely, 

(I 1 1 1\ 
1111 
1112' 
\0 0/ 

where rtiyki is the count of cases in which X\ = i, Xi = j, X% = k and = I. 
The maximum likelihood estimate for our data set is a solution to a system 
of algebraic equations in 16 indeterminates rhijki- The last four coordinates 



(4.13) 



/ ^-OOOO ^-OOOl ^-OOIO m 0011 \ 

"10100 W-0101 "1-0110 W-0111 

m iooo m imi wioio raioii 

Vmnoo w-noi mmo rami/ 
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of the maximum likelihood estimate will automatically be zero, 

muoo = m-1101 = w-mo = "*im = 0, 

since the last row sum is a sufficient statistic. We are hence left with a system 
of equations in twelve indeterminates which we call the simplified four-cycle 
model. This system consists of five binomials and eight linear equations. 
The following binomials are the minimal generators of the toric ideal of the 
simplified four-cycle model: 

w-ooii^iooi — w-0001^1011 

= "T.0011^0110 — ^ooio^om 

= ^oooi^oioo — w-oooo^oioi = mooio"nooo — W-0000^1010 

= ^oioo^omm-iooimioio - m-oioi^oiio^iooo^ioii = 0. 

We compute the marginal counts for the cliques in our model using the data 
set (4.13) and set these counts equal to the sum of the MLE for the cells 
associated with each clique as follows: 

moo++ = irioooo + rhoooi + m oio + m-oon = 4, 

w-oi++ = "*oioo + m ioi + "lono + w-oni = 4, 

"no++ = "iiooo + "Hooi + "Hoio + "Hon = 5, 

™+00+ = "*oooo + "loooi + "Hooo + w-iooi = 4, 

™++io = m-ooio + mono + "iioio = 3, 

■m++u = m on + m m + "noil = 4, 

w-o++i = "ioooi + "loioi + "loon + m m = 4, 

"n++o = "iiooo + "Hoio = 2. 

Note that the following linear equations are implied and hence redundant 
in our system: 

™+oi+ = m oio + m oii + mi io + m-ion = 5, 
™+io+ = m ioo + "loioi = 2, 
m + n+ = mono + mom = 2, 
m ++oo = m oooo + moioo + miooo = 3, 
m++oi = "*oooi + moioi + miooi = 3, 
™o++o = ™-oooo + m ioo + m oio + m no = 4, 
mi ++ i = miooi + mion = 3. 
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The positive solution to these equations can be found numerically using 
iterative proportional scaling: 

m ooo m ooi m-ooio m 00 ii\ /0.96 0.83 1.03 1.18 \ 
w-oioo "loioi m iio mom = 1-07 0.93 0.93 1.07 . 
m-iooo miooi m 10 io m 10U / \0.97 1.24 1.03 1.76/ 

Our main point, however, is to analyze the equations using symbolic alge- 
bra instead of numerical computation. We enter our five binomials and eight 
linear equations into the computer algebra system Macaulay 2 by Grayson 
and Stillman [16]. To keep the notation simple, we replace the 12 indetermi- 
nates moooo> moooi > • ■ • , mion by a, b, . . . , 1. The command gb MLE computes 
the reduced Grobner basis of our maximum likelihood equations in lexico- 
graphic term order: 

11 : R = QQ[a,b,c,d,e,f ,g,h,i, j ,k,l, 

MonomialOrder => Lex] ; 

12 : MLE = ideal (c*h-d*g , b*l-d*j , a*k-c*i, 

a*f-b*e, e*h*j*k-f *g*i*l, a+b+c+d-4, 
e+f+g+h-4, i+j+k+1-5, a+b+i+j-4, 
c+g+k-3, d+h+1-4, b+f+d+h-4, i+k-2) ; 

o2 : Ideal of R 

13 : gb MLE 

The Grobner basis consists of twelve polynomials: 

«5 _ 362 ^4 , 6713/?3 , 110 p2 _ 2368 p , 480 
1 39 z ' 351 c ' 9 39 1 " r 13 ' 

U , 6539 pA _ 58985 p3 _ 513737 p2 , 490447 p 585 
^ " r 22304 1 33456 c 602208 " r 100368 c 2788' 

j + £-S,i + j + k + e-5,h+lk 2 -±k£ + lk-±£ 2 + ±£-l 
g + h-2,f -2h-2k + 3£-2,e + f + g + h - 4,d + h + 1 - A, 
c + g + k-3,b + f -£,a- f -g-h-k + 3. 

The polynomials are in triangularized form; that is, each indeterminate is 
expressed in terms of indeterminates which come later in the alphabet. The 
only exception is the first equation in the Grobner basis, which is a polyno- 
mial in the single variable £ and which we denote by ip{£). The properties 
of this polynomial which are relevant for our discussion are given by the 
following proposition: 

Proposition 3. The polynomial ip{£) is irreducible over the rational 
numbers, and its Galois group is the symmetric group on five letters. 
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Proposition 3 can be established using the computer algebra system Maple 
with the Galois command. The implications of this proposition are twofold. 
First, one cannot always find a rational solution to maximum likelihood for a 
nondecomposable model. Second, none of the five real roots of the equation 
can be expressed in terms of radicals. Thus, unlike a decomposable model, 
the MLE for a nondecomposable model cannot, in general, be given by an 
algebraic expression of the cell counts. For a more detailed discussion of 
Galois groups and irreducibility the reader can consult a book on Galois 
theory such as Stewart [27]. 

For the full four-cycle model, when none of the table counts is zero, the 
degree of the first equation in the Grobner basis is 13 instead of 5. In par- 
ticular, the polynomial of rami is irreducible of degree 13, and the other 
15 coordinates rhijki of the maximum likelihood estimator are expressed as 
a polynomial with rational coefficients in rhuu. 

It would be interesting to find a combinatorial formula for the degree 
of the maximum likelihood estimator as a function of the structure of the 
undirected graphical model A(G). A better understanding of this algebraic 
degree is likely to have applications in computational statistics. 

4.7. A characterization of decomposable models. The following theorem 
provides a characterization of decomposable models. 

Theorem 4.4. Let G be an undirected graphical model for discrete vari- 
ables. Then the following five statements are equivalent: 

(a) G is a decomposable graphical model. 

(b) A distribution P factors according to G if and only if P satisfies a 
set of quadratic binomials corresponding to global separation statements in 
G. 

(c) The ideal Ia(g) has a quadratic Grobner basis in which each polyno- 
mial corresponds to a global separation statement in G. 

(d) The maximum likelihood estimate for G is a rational function. 

(e) The set image(^(c)) is closed. 

The fact that (a) implies (c) is Theorem 4.3. As described in Sections 
4.3 and 4.6, it is known that (a) implies (d) and (a) implies (e). Note that 
(c) implies (b) is a trivial implication, so the only thing to prove for the 
above theorem is (b) implies (a), (d) implies (a) and (e) implies (a). We use 
constructions based on examples from previous sections to prove the result. 

The essential idea of the proof is that every nondecomposable model con- 
tains a four-cycle and we prove all these claims by lifting the examples 
developed in Sections 4.5 and 4.6 for the four-cycle model to other nonde- 
composable models. For these results we use the following graph-theoretic 
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definitions. A set of vertices A is connected in undirected graph G if and 
only if there is a path in G between every pair of vertices in A that only 
passes through vertices in A. The sets A, B, C, D, E are a nondecomposable 
partition for graph G with vertices X if and only if (1) the sets A, B, C, D, E 
are disjoint, (2) X = A U B U C U D U E, (3) A, B, C, D are not empty, (4) 
the subgraph of G over A, B, C, D is a cycle with no chords and (5) each of 
the sets A,B,C,D is connected in G. 

Proposition 4. If the undirected graph G is nondecomposable, then 
there exists a nondecomposable partition for the graph. 

One can construct a nondecomposable partition for a nondecomposable 
graph G with vertex set X as follows. First let Ci,...,C n be a cycle in 
G with no chords and length n > 4. One nondecomposable partition is 
given by the following five sets: A = {Ci, . . . , Cj_i}, B = {Ci, . . . , Cj-i}, 
C = {Cj, Ck-i}, D = {Ck, C n } where l<i<j<k<n, and E = 
X \ {Ci, . . . , C n }. We are now prepared to prove the needed claims. 

Proof of "(b) implies (a)." We use Example 8 to show that the zero 
set of Ia(g) f° r an arbitrary nondecomposable model cannot be specified by 
quadratic binomials. We exhibit a probability distribution p which is in the 
zero set of all quadratic binomials in Ia(G) Du t is not in the zero set of Ia(G) ■ 

Let G be a nondecomposable graph and let the sets A, B, C, D, E form a 
nondecomposable partition of G. We define a probability distribution p such 
that poiooi = Poini = pioon = pioioi = 1/4 where pijkim is the probability 
that each variable in A has value i and each variable in B has value j and 
each variable in C has value k and each variable in D has value k and each 
variable in E has value m. 

Consider the nonquadratic binomial PoiooiPoimPiooiiPioioi -P01011P01101 x 
PioooiPioin- This binomial lies in /.4(G) because the intersection of any clique 
of G and the set of vertices on the cycle defining the nondecomposable par- 
tition (i.e., A U B U C U D) is either the empty set, a singleton, or pair of 
adjacent vertices. Restricting the indices of the two quartic monomials to 
any clique gives two identical monomials, which means that the binomial 
lies in I A[G) . 

We claim that every quadratic binomial in Ia(g) vanishes at p. Suppose 
not. Then there exists a binomial p a Pb — PcPd which lies in Ia(g) an d, after 
relabeling, our probability distribution p satisfies p a = Pb = 1/4 and p c Pd = 0. 
Hence a and b are among the four basic events with positive probability. For 
any such pair a, b, it is easy to check that the sum of the two columns of 
A(G) indexed by a and b cannot be written in any other way as a sum of 
columns of A(G). The reason is that a and b agree in a connected subset of 
the A;-cycle and they also disagree in a connected subset of the fc-cycle. We 
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conclude that every quadratic binomial in Ia(g) vanishes at our probability 
distribution P, which completes the proof of the implication from (b) to (a). 

Proof of "(d) implies (a)." We lift the example described by (4.13) that 
demonstrates the potential nonrationality of the estimates for a four-cycle 
(Section 4.6) to an arbitrary nondecomposable graph. Let G be an arbitrary 
nondecomposable graph and let the sets A, B, C, D, E form a nondecompos- 
able partition of G. We define a data set for the variables in G by expanding 
the data set defined by (4.13). Let denote the count of cases in which 

all of the variables in A have the value i and all of the variables in B have 
the value j and so on. If we let the data set be 



(4.14) 
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where all of the counts not shown are zero, then the estimate for nmn will 
be identical to the nonrational maximum likelihood estimate of mmi from 
Section 4.6. 

Proof of "(e) implies (a)." Let G be a nondecomposable graph and let 
the sets A, B, C, D, E form a nondecomposable partition of G. We construct 
a sequence of distributions that factor according to G but whose limiting dis- 
tribution does not factor. We do so by defining pairwise potential functions 
for each of the edges in a graph G — that is, a log- linear distribution where 
every generator is a set of at most two variables. A log-linear distribution of 
this form will also factor according to G because the pairwise potential func- 
tions can be combined into clique potentials. First V^G) = 1- We consider 
pairs of vertices {X, Y} connected by an edge in G. If X G E or Y G E, then 
iI> x ,y(; ■) = 1- If {X, Y} C A, {X, Y} C B, {X, FjcCor {X, Y} C D, then 
we define "0x^(0, 0) = i/>x,y0-> 1) = n and otherwise ipx,Y('i ') = 1 • F rom the 
definition of a nondecomposable partition, A is connected to exactly two of 
the sets B, C, D. Without loss of generality suppose that A is connected to B 
and D. Finally we add the potentials for edges between the sets A, B, C, D. If 
IgA and l£B, then %p X y (x, y) = n^ xy - y ^ if x € {0, 1} and y G {0, 1} and 
^x,y{ x iV) = 1 otherwise. If X G B and X G C, then ^x,y(^,y) = n^ xy ~ y ^ if 
x G {0, 1} and y G {0, 1} and ipx,Y(x,y) = 1 otherwise. If X G C and X G D, 
then ip Xt Y(x,y) = rS xy ^ if x G {0, 1} and y G {0, 1} and ipx,Y(. x >v) = 1 °th~ 
erwise. If X G A and X G D, then tpx,Y(x,y) = n^~ xy ^ if x G {0,1} and 
y G {0,1} and ^x,y(x,y) = 1 otherwise. We consider the sequences of dis- 
tributions defined by these pairwise potentials as n — > oo. If we consider the 
four-cycle graph, then the limiting distribution is equal to the distribution 
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given in Example 7 and the distribution does not have ^4-feasible support. In 
the limiting distribution for a general nondecomposable graph G, the vari- 
ables in E are mutually independent and independent of all other variables 
and all of the variables within either A,B,C or D are deterministically 
related. Thus, checking whether the support of the limiting distribution is 
j4-feasible reduces to the problem of checking the support of Example 7. 

APPENDIX: PROOFS OF THEOREMS 3.1 AND 3.2 

We first note via the next lemma that it would be equivalent in the defi- 
nition of the nonnegative toric variety Xa to allow u±,. . . , u m , vi,...,v m to 
be nonnegative real numbers rather than integers. 

Lemma A.l. Let Z A be the set of all vectors (x\, . . . , x m ) £ R>q which 



whenever u = (u\, . . . , u m ) and v = (v\,..., v m ) are vectors of nonnegative 
real numbers which satisfy the d linear relations 



Then Z A = X A . 

Proof. Clearly, Za Q Xa- For the converse, let x be a point in X A . We 
need to show that (A.l) with u,v being nonnegative real vectors holds for 
the point x. 

A vector b is sign- compatible with a vector c if every nonzero entry of b 
agrees in sign with the vector c. We denote by c + a vector whose jth entry 
equals Cj for all nonnegative entries of c and is zero otherwise. Similarly, we 
denote by c~ a vector whose jth entry equals —Cj for all negative entries 
of c and is zero otherwise. Clearly, c + and c~ are nonnegative vectors and 
c = c + — c~ . 

From Lemma 4.10 of [28], there exist integer vectors Wj that are sign- 
compatible with w :=u — v such that w = J2j a j w ji where (i) and wj 
satisfy (A. 2) and (ii) ay > 0. From (i) and the definition of Xa, we have 

x w j = x w i for all x G Xa- From (ii) and the fact that all of the Wj are 
sign-compatible with w, we can write w + = Y^j a j w f an d w~ = J2j a j w J ■ 
Because < u — w + = v — w~ , the expression x u ~ w+ = x v ~ w is well defined 
and holds for all x £ R> . Therefore we can validly write x u = x w x ll ~ w 

and x v = x w x v ~ w . As w + = J2j a j w j~ an d w~ = ~}2,j a j w J > an d (A.l) holds 
for each pair (u>+ , wj ) , it is straightforward to show that x w+ = x w , thus 





V\ V2 



III 



(A.2) 



u\a\ + U202 H h u m a m = v\a\ + V2a2 H h v m a m . 



x = x 



□ 
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Theorem 3.1. A probability distribution P factors according to A if 
and only if P lies in the nonnegative toric variety X A and the support of P 
is A- feasible. 

Proof. The only-if direction has been proved in Lemmas 1 and 2. 

For the if direction, fix any vector P 6 X A whose support F = supp(P) 
is A- feasible. We must prove that P lies in the image of <j>A, or equivalently 
that the system of (3.1) has a nonnegative real solution vector (ti,...,i<j). 
Note that for the definition of Xa we use nonnegative real exponents in 
(A.l) as justified by Lemma A.l. 

Consider the following system of equations for the indeterminates ti,...,td'. 

d 

(A.3) tl t T i= P3>° forjGF. 

i=i 

We claim that this system has a solution (ti, . . . , tj) all of whose coordinates 
are positive real numbers. Introducing new variables Tj = log(tj), our claim 
is equivalent to the assertion that the following system of linear equations 
in n, . . . ,T(j has a solution: 

d 

(A.4) a v T * = l °g(Pj ) for 3 G F - 

i=i 

We proceed by contradiction. 

Suppose that the system in (A.4) has no solution. This is a linear system 
of \F\ equations (over the field of the real numbers) in d variables which can 
be written as By = c where B is an \F\ x d-matrix, and c = (log(pj),j E F) is 
a vector of length \F\. Assuming that (A.4) has no solution means that c is 
not in the column space of B. Thus, there exists a row vector q of length \F\ 
such that qB is the zero vector but the inner product between the vectors q 
and c is not zero. 

We now set Uj = max{0,(?j} and Vj = max{0, — qj}. Then qj = Uj — Vj 
and the identity J2j£F°.j a ij = °B = translates into an identity of the 
form in (A. 2) where Uj = v j = for all indices j not in F. It follows from 
EjeFQj !og(Pj) 7^0 that 

(A.5) u i l °s(Pj) + v o l °s(Pj)- 

j€F j£F 

Therefore, 

j&F j£F 

Consequently, the point P does not satisfy (A.l) as required by all points 
on X A . Hence, P cannot lie on the nonnegative toric variety X A , contrary 
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to our assumption. Therefore the system in (A. 4) can be solved for Tj, and 
hence (A. 3) has a solution ti = exp(rj), i G {1, . . . , d}, in the positive reals. 

The solution (ti, . . . ,td) just obtained is arbitrary at each index i not in 
/ = Uze-F su PP( a z) because for each such i, aij = for every j G F. We now- 
set ti = for all i G {1, ... ,d}\I. Since F is A-feasible, for each j £F there 
exists an i G {1, . . . , d} \ I such that Oy > 0. Hence, nf=i ^ = *-* f° r 3 & 
We conclude that the modified vector (ti,...,td) satisfies (3.1), and hence 
P G image (4>a)- D 

We fix the d x m matrix j4 with columns oi, . . . ,a m as before. A subset 
F of {1,2,..., m} is said to be facial if there exists a vector c in R rf such 
that 

(A. 6) c T ai = for i G F and c T aj > 1 for i G {1, . . . , m}\F. 

Hence, the vector c is orthogonal to the columns whose index is in F and 
not orthogonal to all other columns of A. The characteristic vector of F is 
(zi, . . . , z m ) with Zj = 1 if z G F and Zj = if i ^ F. 

Lemma A. 2. For a subset F of {1, . . . ,m} and a matrix A, the following 
statements are equivalent: 

(a) F is facial for A. 

(b) The characteristic vector of F lies in the nonnegative toric variety 
X A . 

(c) There exists a vector with support F in the nonnegative toric variety 
X A . 

Proof. Assume that (a) holds. We will first show that no nonzero non- 
negative combination of the cij, i £ F, can be written as a linear combination 
of the ai, i&F. Let c satisfy (A. 6), and suppose J2i<£F a i a i = J2i£F fti^-i, 
where a, > 0, i F. Then < E^F a « ^ E^F a « cTa « = ° T ' (E^F a « a i) = 
c T ■ {J2ieF Pi a i) = 0, hence = for i ^ F. Thus, there is no identity 
in (A. 2) where supp(n) C F and supp(v) has nonempty intersection with 
{1, . . . ,m} \ F. Consequently, for every linear relation in (A. 2), either both 
supp(n) and supp(v) are subsets of F or neither of supp(n) and supp(n) 
is a subset of F. However, this means that the characteristic vector of the 
set F satisfies (A.l) [namely, both sides of (A.l) are or both sides are 
1] whenever (A. 2) holds. Equivalently, the characteristic vector of F lies in 
X A . Hence (a) implies (b). 

Clearly, (b) implies (c). It remains to show that (c) implies (a). For this 
step we apply Farkas' lemma (linear programming duality); see Corollary 
7.1e, Section 7.3, in [25]. Farkas' lemma reads as follows: Let D be a matrix 
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and e be a vector. Then the system Dx < e has a solution x if and only if 
ye>0 for each nonnegative row vector y with yD = 0. 

We define a matrix D with m + \F\ rows as follows: The first \F\ rows are 
the vectors — for i G F. The next \F\ rows are the vectors +aj for i G F. 
The last m — \F\ rows are the vectors — cti for i ^ F. Let e be the column 
vector with m + \F\ coordinates as follows: The first \F\ entries are 0. The 
next \F\ entries are 0. The last m — \F\ entries are —1. 

Suppose xq G Xa = Za, supp(xo) = F. Any nonnegative row vector y = 
(y^\y^ 2 \y^) of respective lengths (|F|,|F|,m— \F\) satisfying yD = 
must have j/ 3 ) = 0. Otherwise, taking u = (y( 2 \0) and v = (y^\y^) would 
satisfy (A. 2) but contradict (A.l) for xo, since Xq > whereas Xq = 0. But 
then, when y^ = 0, each nonnegative solution of yD = trivially satisfies 
ye > 0, hence by Farkas' lemma, Dx < e has a solution. Consequently, (a) 
holds. □ 

Theorem 3.2. A probability distribution P factors according to A or is 
the limit of probability distributions that factor according to A if and only if 
P lies in the nonnegative toric variety X A . 

Proof. The claim is that Xa = closure^mage^^)). By Lemma 2, the 
image of 4>a lies in Xa- The set X A is closed in R™ because it is defined 
by polynomial equations. Hence the closure of image^^) is contained in 
X A . Therefore it suffices to prove that Xa C closure(image(0 J 4)). This is 
shown by taking a point P G X A \ image (4>a) and showing that P lies in the 
closure of image(0 J 4). The argument is composed of three steps. Given a 
point P G AT^\ image (4>a), we first define a sequence of points P(e); we then 
prove that lim e ^o -P( e ) = P\ an d finally, we prove that P(s) G image(^A) for 
all e > 0. 

Let P G X A \ image (4>a) an d F = supp(P). In order to define -P(e), con- 
sider the following system of equations for the indeterminates t\, . . . ,td- 

d 

(A. 7) Y[t a y= Pj >0 forjGF. 

i=l 

This system of equations is identical to (A. 3). We have shown that it has a 
solution (ti,...,td) all of whose coordinates are positive real numbers. By 
Lemma A. 2, the set F = supp(P) is facial. We now fix c G R d so that (A.6) 
holds. We introduce a positive real parameter e > 0, and define the vector 

/ d d d \ 

P{e) = £ cTa i Y[ , e cTa2 J] tf 2 , . . . , e cTam [] if™ . 

V i=l i=l i=l ) 

The condition in (A.6) implies that lim e ^o-P( e ) = P because for every j in 
F, e cTa i is always 1 and so pj tends to nf=i ^ , and for every j not in F, 
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e c a i tends to and so pj tends to 0. Finally, note 

exp(c T aj) = exp(ciaij + c 2 a 2 j H h Cdadj) 

= exp(ci) a « exp(c 2 ) a ^ • • •exp(c (i ) a * 
d 

i=l 

where exp denotes the exponential function with base e. Thus, the jth co- 
ordinate of the vector P(e) equals nf=i(exp(c i ) ai -'i" 1 '') = nf=i(exp(cj)^) a « . 
Hence P(e) is the image of the strictly positive vector (exp(ci)ii,exp(c2)i2, 
. . . ,exp(cd)t(i) under the map <pA- This shows that P{e) lies in the image of 
<j)A for all e > 0. □ 
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